Statistics for Data Science II
Recall the general linear model, y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k
This is a multiple regression model because it has multiple predictors (x_i).
\beta_0 is the y-intercept, or the average outcome (y) when all x_i = 0.
\beta_i is the slope for predictor i and describes the relationship between the predictor and the outcome, after adjusting (or accounting) for the other predictors in the model.
glm() function,summary() function to obtain information about the model,lm() function to construct the linear model,glm() for reasons that will make sense in the future.
Call:
glm(formula = imdb_rating ~ season + episode, family = "gaussian",
data = office_ratings)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.605835 0.102452 83.999 < 2e-16 ***
season -0.093289 0.015109 -6.174 4.11e-09 ***
episode 0.013616 0.005132 2.653 0.00868 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.238935)
Null deviance: 54.140 on 187 degrees of freedom
Residual deviance: 44.203 on 185 degrees of freedom
AIC: 269.36
Number of Fisher Scoring iterations: 2
Call:
lm(formula = imdb_rating ~ season + episode, data = office_ratings)
Residuals:
Min 1Q Median 3Q Max
-1.43672 -0.29274 -0.03135 0.26734 1.62060
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.605835 0.102452 83.999 < 2e-16 ***
season -0.093289 0.015109 -6.174 4.11e-09 ***
episode 0.013616 0.005132 2.653 0.00868 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4888 on 185 degrees of freedom
Multiple R-squared: 0.1835, Adjusted R-squared: 0.1747
F-statistic: 20.79 on 2 and 185 DF, p-value: 7.147e-09
Thus, the resulting model is \hat{y} = 8.61 - 0.09 x_1 + 0.01 x_2, where
y is the IMDB rating,
x_1 is the season number, and
x_2 is the episode number.
Instead of using y and x_i, we can state it this way: \hat{\text{rating}} = 8.61 - 0.09 \text{ season} + 0.01 \text{ episode}
I prefer using this method when explaining models to collaborators.
For homework/projects, I do not care which method you use.
We want to put the slope into perspective for whoever we are collaborating with.
Basic interpretation: for every 1 [units of x_i] increase in [x_i], [y] [increases or decreases] by \left[ \left| \hat{\beta}_i \right| \right] [units of y].
We can also scale our interpretations. e.g.,
Note that in the case of multiple regression, there is an unspoken “after adjusting for everything else in the model” at the end of the sentence.
For a 1 season increase in season number, we expect IMDB ratings to decrease by 0.1 points.
For a 1 episode increase in episode number, we expect IMDB ratings to increase by 0.01 points.
The intercept is the average of the outcome when all predictors in the model are equal to 0.
In our example,
The average IMDB rating for season 0, episode 0 is 8.61.
Is this reasonable?
Recall confidence intervals – they allow us to determine how “good” our estimation is.
In general, CIs will take the form
point estimate \pm margin of error
Recall that the standard error accounts for the sample size.
In R, we will run the model results through the confint() function.
2.5 % 97.5 %
(Intercept) 8.405032659 8.80663733
season -0.122902498 -0.06367481
episode 0.003556247 0.02367506
We have the following CIs:
level.Hypotheses
Test Statistic and p-Value
lm() or glm().Rejection Region
Conclusion/Interpretation
Test Statistic and p-Value
lm() or glm(). 🧐If glm():
lm():
Call:
lm(formula = imdb_rating ~ season + episode, data = office_ratings)
Residuals:
Min 1Q Median 3Q Max
-1.43672 -0.29274 -0.03135 0.26734 1.62060
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.605835 0.102452 83.999 < 2e-16 ***
season -0.093289 0.015109 -6.174 4.11e-09 ***
episode 0.013616 0.005132 2.653 0.00868 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4888 on 185 degrees of freedom
Multiple R-squared: 0.1835, Adjusted R-squared: 0.1747
F-statistic: 20.79 on 2 and 185 DF, p-value: 7.147e-09
Hypotheses
Test Statistic and p-Value
Rejection Region
Conclusion/Interpretation
Let’s now review the hypothesis test to determine if the individual slopes are significant.
summary(), regardless if we used glm() or lm().Hypotheses
Test Statistic and p-Value
summary()Rejection Region
Call:
glm(formula = imdb_rating ~ season + episode, family = "gaussian",
data = office_ratings)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.605835 0.102452 83.999 < 2e-16 ***
season -0.093289 0.015109 -6.174 4.11e-09 ***
episode 0.013616 0.005132 2.653 0.00868 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.238935)
Null deviance: 54.140 on 187 degrees of freedom
Residual deviance: 44.203 on 185 degrees of freedom
AIC: 269.36
Number of Fisher Scoring iterations: 2
Hypotheses
Test Statistic and p-Value
t_0 = -6.174
p < 0.001
Rejection Region
Conclusion/Interpretation
Hypotheses
Test Statistic and p-Value
t_0 = 2.653
p = 0.009
Rejection Region
Conclusion/Interpretation
Once we have a regression model, we can construct the predicted value (\hat{y}). Recall, \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + ... + \hat{\beta}_k x_k
If we plug in actual responses from the observations in the dataset, we can simplify and find the predicted value.
In R, we will use the predict() function to add the predicted value to our dataset.
Note 1: we can use the predict() function regardless of using lm() or glm() to construct the model.
Note 2: there are issues using predict() if there is missing data in your dataset.
Note 2: there are issues using predict() if there is missing data in your dataset.
When we model using the glm() and lm() functions, R automatically excludes any observations missing the variables specified in the model (either y or any x_i).
We can exclude missing data on the front end,
Note that this is not the optimal solution to dealing with missing data.
There are methods specifically to handle missing data (here is one free textbook)
It often helps for us to visualize the model we have constructed.
Recall that while we often refer to it as a “regression line,” it is actually a “response surface”
We must hold all but one variable (that will be on the x-axis) constant and create a predicted value.
This means that we will create a version of a predicted value.
Unfortunately there is not a function to do this.
In theory you could hard code everything, however, I prefer to save the output of the coefficients() function and call individual pieces (see example).
I personally think season is more interesting, so I will let that one vary.
This means that we need to plug in value(s) for episode.
office_ratings %>%
ggplot(aes(x = season)) +
geom_point(aes(y = imdb_rating)) +
geom_line(aes(y = y_hat_ep1), color = "blue") +
geom_line(aes(y = y_hat_ep10), color = "purple") +
geom_line(aes(y = y_hat_ep20), color = "darkgreen") +
geom_text(aes(x = 9.75, y = 8.05, label = "Episode 20 ")) +
geom_text(aes(x = 9.75, y = 7.90, label = "Episode 10 ")) +
geom_text(aes(x = 9.75, y = 7.75, label = "Episode 1 ")) +
scale_x_continuous(breaks = office_ratings$season) +
labs(x = "Season", y = "IMDB Rating") +
ylim(6, 10) +
theme_bw()office_ratings %>%
mutate(finale = if_else((season == 1 & episode == 6) |
(season == 2 & episode == 22) |
(season == 3 & episode == 23) |
(season == 4 & episode == 14) |
(season == 5 & episode == 26) |
(season == 6 & episode == 26) |
(season == 7 & episode == 24) |
(season == 8 & episode == 24) |
(season == 9 & episode == 23), "Yes", "No")) %>%
ggplot(aes(x = season)) +
geom_point(aes(y = imdb_rating, color = as.factor(finale))) +
geom_line(aes(y = y_hat_ep1)) +
geom_line(aes(y = y_hat_ep10)) +
geom_line(aes(y = y_hat_ep20)) +
geom_text(aes(x = 9.75, y = 8.05, label = "Episode 20 ")) +
geom_text(aes(x = 9.75, y = 7.90, label = "Episode 10 ")) +
geom_text(aes(x = 9.75, y = 7.75, label = "Episode 1 ")) +
scale_x_continuous(breaks = office_ratings$season) +
labs(x = "Season", y = "IMDB Rating", color = "Finale") +
ylim(6, 10) +
theme_bw()Even with a simple example, graphing is still complicated.
In the example:
IMDB ratings must go on the y-axis as it is our outcome.
We selected season to be on the x-axis, however, we could have put episode number on the x-axis.
We graphed a few different episode numbers - we could limit it to one or we could expand it to include more.
You will see that graphing resulting regression results gets really complicated really fast simply due to the number of predictors in the model.
We can also do fancier things with graphs, like include confidence bands or overlay loess lines.
In class, I demonstrate the base “example” graphs I provide collaborators as a tool for discussion.
These are just starting points and I will edit graphs based on feedback and requests from collaborators.
Today we reminded ourselves about multiple regression. We covered how to:
construct the model in R,
interpret the intercept and slope,
find the confidence interval for \beta_i,
determine if the regression line is significant,
determine if individual predictors are significant,
create predicted values, and
visualize regression results.
Other lectures this week include:
regression assumptions,
regression diagnostics, and
introduce categorical predictors.